NLP4NLP: Applying NLP to Scientific Corpora about Written and Spoken Language Processing
نویسندگان
چکیده
Analyzing the evolutions of the trends of a scientific domain in order to provide insights on its states and to establish reliable hypotheses about its future is the problem we address here. We have approached the problem by processing both the metadata and the text contents of the domain publications. Ideally, one would like to be able to automatically synthesize all the information present in the documents and their metadata. As members of the NLP community, we have applied the tools developed by our community to publications from our own domain, in what could be termed a “recursive” approach. In a first step, we have assembled a corpus of papers from NLP conferences and journals for both text and speech, covering documents produced from the 60’s up to 2015. Then , we have mined our scientific publication database to draw a picture of our field from quantitative and qualitative results according to a wide range of perspectives: ranging from sub-domains, specific communities, chronology, terminology, conceptual evolution, re-use and plagiarism, trend prediction, novelty detection and many more. We provide here an account of the corpus collection and of its processing with NLP technology, indicating for each aspect which technology was used. We conclude on the benefits brought by such corpus to the actors of the domain and on the conditions to generalize this approach to other scientific domains. Conference Topics Methods and techniques, Citation and co-citation analysis, Scientific fraud and dishonesty, Natural Language Processing
منابع مشابه
Robust Extraction of Subcategorization Data from Spoken Language
Subcategorization data has been crucial for various NLP tasks. Current method for automatic SCF acquisition usually proceeds in two steps: first, generate all SCF cues from a corpus using a parser, and then filter out spurious SCF cues with statistical tests. Previous studies on SCF acquisition have worked mainly with written texts; spoken corpora have received little attention. Transcripts of ...
متن کاملConcordancing for parallel spoken language corpora
Concordancing is one of the oldest corpus analysis tools, especially for written corpora. In NLP concordancing appears in training of speech-recognition system. Additionally, comparative studies of different languages result in parallel corpora. Concordancing for these corpora in a NLP context is a new approach. We propose to combine these fields of interest for a multi-purpose concordance for ...
متن کاملParsing Arabic Dialects
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem ...
متن کاملCross-Domain and Cross-Language Porting of Shallow Parsing
English was the main focus of attention of the Natural Language Processing (NLP) community for years. As a result, there are significantly more annotated linguistic resources in English than in any other language. Consequently, data-driven tools for automatic text or speech processing are developed mainly for English. Developing similar corpora and tools for other languages is an important issu...
متن کاملInstant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora
Thepaper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015